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Abstract 


Feed-forward  layered  networks  trained  on  a  pattern  classification  task  in  which  the 
number  of  training  patterns  in  each  class  is  nonuniform,  bias  strongly  in  favour  of 
those  classes  with  largest  membership.  This  is  an  unfortunate  property  of  networks 
when  the  relative  importance  of  classes  with  smaller  membership  is  much  greater 
than  that  of  classes  with  many  training  patterns.  In  addition,  there  are  many 
pattern  classification  tasks  where  different  penalties  are  associated  with  misdassi- 
fving  a  pattern  belonging  to  one  class  as  another  class  It  is  not  generally  known 
how  to  compensate  for  such  effects  in  network  training.  This  paper  discusses  an 
analytical  regularisation  scheme  whereby  prior  expectations  of  class  importance  oc¬ 
curring  in  the  generalisation  data  and  misclassification  costs  may  be  incorporated 
into  the  training  phase,  thus  compensating  for  the  uneven  and  unfair  class  distri¬ 
butions  occurring  in  the  training  set.  The  effects  of  the  proposed  scheme  on  the 
feature  extraction  criterion  employed  in  the  hidden  laver  of  the  network  is  discussed. 
An  illustration  of  the  results  is  presented  by  considering  a  real  medical  prognosis 
problem  concerning  data  collected  from  head-injured  coma  patients.  Relation¬ 
ships  between  least  mean  square  error  minimisation  and  Bayesian  minimum  risk 
estimation  is  mentioned  and  the  importance  and  relevance  of  input /output  coding 
schemes  for  network  performance  is  considered. 
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i  Introduction 


Connectionist  models  based  on  adaptive  layered  networks  have  been  used  with  some  success 
when  operating  as  static  pattern  classifiers  in  problems  as  diverse  as  sonar  [11]  and  radar  [1] 
classification,  speech  recognition  [16]  and  medical  diagnosis  [3].  The  ability  of  feed-forward 
layered  networks  to  perform  static  pattern  discrimination  stems  from  their  potential  to  cre¬ 
ate  a  specific  nonlinear  transformation  into  a  space  spanned  by  the  outputs  of  the  hidden 
units  in  which  class  separation  is  easier  [19,  15j  (these  comments  will  be  discussed  in  more 
detail  later  and  mathematically  summarised  in  the  appendices).  This  transformation  is 
constrained  to  maximise  a  feature  extraction  criterion  which  may  be  viewed  as  a  nonlin¬ 
ear  multi-dimensional  generalisation  of  Fisher's  Linear  Discriminant  function  [9],  Since 
this  criterion  involves  the  weighted  between  class  covariance  matrix  (where  the  weighting 
is  determined  by  the  square  of  the  number  of  patterns  in  each  class),  adaptive  networks 
trained  on  a  1-from-c  classifier  problem  (for  a  c  class  problem)  bias  strongly  in  favour  of 
those  classes  which  have  the  largest  membership  in  the  training  data.  Thus,  in  order  to 
minimise  the  sum  square  error  over  the  entire  training  set,  the  optimum  solution  for  the 
network  parameters  is  such  that  the  network  misdassifies  patterns  in  classes  with  smallest 
representation  in  favour  of  those  with  larger  representation  in  the  training  set,  irrespective 
of  the  frequency  of  occurrence  or  relative  class  importance  in  actual  ‘operation’. 

This  is  an  undesirable  feature  of  networks  (and  many  other  standard  classifiers)  in  prob¬ 
lems  where  information  on  one  particular  class  may  be  more  difficult  or  expensive  to  obtain 
than  other  classes,  and  where  the  relative  importance  of  the  classes  follows  another  distri¬ 
bution  to  their  frequency  of  occurrence.  For  instance,  in  speech  recognition  the  bulk  of 
the  continuous  acoustic  signal  consists  of  silence  whereas  the  dominant  information  content 
is  contained  in  the  subword  units  (‘phonemes')  which  themselves  have  differing  importance 
to  their  frequency  of  occurrence.  Another  example  which  illustrates  asymmetric  misclas- 
sification  costs  is  in  the  problematic  realm  of  medical  prognosis:  given  a  feature  pattern 
as  determined  from  a  set  of  obser  Aions  on  a  patient,  what  are  the  likely  future  health 
prospects  of  that  patient.  Clearly  it  is  more  important  to  diagnose  a  serious  ailment  cor¬ 
rectly  than  to  diagnose  a  minor  complaint  correctly.  However,  and  more  complicated,  it 
is  a  more  serious  error  to  predict  incorrectly  that  a  given  patient  will  die  if  that  patient 
would  in  fact  recover,  than  to  predict  incorrectly  that  a  patient  would  survive  when  he 
actually  dies,  particularly  if  resources  had  to  be  limited  to  those  in  most  need  who  would 
gain  maximum  benefit.  Thus,  the  problem  is  compounded  by  asymmetric  misdassification 
costs. 

These  are  illustrative  of  real  pattern  classification  problems  where  the  distribution  of 
patterns  amongst  the  different  classes  in  the  training  set  is  nonuniform  and  also  could  follow 
a  different  distribution  to  the  expected  occurrence  or  the  relative  importance  of  the  classes 
in  operation.  In  addition,  there  may  be  further  prior  knowledge  which  could  be  used  to 
associate  a  penalty  of  misdassification  of  each  class  with  any  other.  In  spite  of  the  obvious 
practical  relevance  of  such  aspects  in  real-world  pattern  processing  problems,  it  is  an  area 
of  network  research  where  very  little  detailed  analysis  has  been  performed.  One  of  the 
aims  of  this  paper  is  to  create  an  awareness  of  the  existence  of  such  problems  in  the  naive 
application  of  adaptive  networks  to  real  data,  and  how  they  arise.  A  second  aim  is  to 
provide  the  theoretical  justification  behind  our  proposed  solutions  to  the  problems  raised. 

It  is  possible,  of  course,  to  develop  heuristic  methods  which  attempt  to  compensate  for 
some  of  the  mentioned  effects  in  training  adaptive  networks.  For  instance,  the  classes  of 
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the  training  data  may  be  sampled  according  to  a  distribution  which  reflects  the  importance 
or  expected  frequency  of  occurrence  of  patterns  in  that  class  and  the  network  subsequently 
can  be  trained  on  the  sampled  data.  Alternatively,  if  training  proceeds  iteratively  and 
sequentially,  the  .lumber  of  iterations  in  the  learning  cycle  of  a  network  may  be  varied 
for  each  class  or  pattern  which  would  have  a  similar  compensatory  effect.  Equivalently, 
the  sum  square  error  minimised  by  the  network  during  training  may  be  weighted  by  the 
frequency  of  occurrence  of  the  patterns  in  each  class.  It  has  not  been  obvious  what 
the  effects  of  such  methods  have  had  on  the  feature  extraction  mechanisms  employed  by 
adaptive  networks. 

In  this  paper,  an  analytic  regularisation  scheme  [15]  is  reviewed  and  discussed.  This 
allows  for  effects  such  as  uneven  class  membership  and  importance,  to  be  compensated  for 
during  training  so  as  to  produce  a  network  with  the  desired  characteristics  in  operation. 
In  particular,  a  network  may  be  ‘tuned’  during  the  learning  phase  by  an  appropriate  choice 
of  error  weighting  and  target  coding  by  exploiting  prior  knowledge  specific  to  the  problem 
under  study.  The  effects  of  these  factors  on  the  space  spanned  by  the  outputs  of  the  hidden 
units  will  be  considered  from  the  point  of  view  of  a  generalised  feature  extraction  criterion. 
The  essential  results  are  stated  in  the  following  section  with  relevant  concise  mathematical 
details  contained  in  the  appendices.  The  second  half  of  the  paper  applies  the  results  to  a 
real  medical  prognosis  problem  taken  from  an  analysis  of  1000  patients  suffering  severe  head 
injury  and  contrasts  the  results  obtained  by  a  feed-forward  network  with  those  achieved  by 
various  standard  pattern  classifiers.  For  this  particular  problem,  one  would  like  to  identify 
quite  early  on  in  the  treatment,  those  patients  likely  to  require  long  term  intensive  care  and 
therapy.  It  happens  that  this  class  of  patient  has  the  smallest  frequency  of  occurrence 
and  so  a  simple  application  of  adaptive  network  techniques  would  find  a  solution  which 
totally  misdassifies  this  class  of  patient  (this  will  be  illustrated).  Indeed  this  is  a  common 
problem  with  most  traditional  statistical  diagnostic  techniques.  It  will  be  illustrated  how 
it  is  possible  to  improve  the  likelihood  of  performing  a  correct  prognosis  on  that  class  of 
patients  requiring  the  greatest  long  term  care  by  exploiting  the  results  of  the  next  section. 
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2  Networks,  Feature  Extraction  and  Discriminant  Analysis 


We  consider  generic  feed-forward  networks  with  an  arbitrary  number  of  hidden  layers  (al¬ 
though  only  one  hidden  layer  is  necessarv  for  approximating  a  given  function  mapping 
arbitrarily  closely [14])  and  each  hidden  node  may  have  a  different  nonlinearity.  Also,  the 
combination  rule  transforming  patterns  from  the  output  of  one  layer  to  the  input  of  the  next 
layer  may  be  the  usual  scalar  product  rule  as  used  in  traditional  multi-layer  perceptrons. 
although  other  combination  rules  can  be  used  [5]  without  altering  our  arguments.  How¬ 
ever,  the  transfer  functions  of  the  output  nodes  are  restricted  to  be  linear  to  allow  us  to 
exploit  the  properties  of  linear  least  mean  square  optimisation  methods  in  the  final  layer  of 
weights.  This  is  not  a  severe  restriction.  For  instance  associative  mappings  of  unsealed 
data  require  output  transfer  functions  which  do  not  restrict  the  range  of  possible  outputs. 
In  this  paper,  network  training  is  viewed  as  a  problem  in  optimisation  by  minimising  the 
total  residual  between  the  desired  target  values  and  the  actual  network  outpuv*  over  the 
entire  training  set.  There  are  other  criteria  for  training  adaptive  networks  which  do  not 
attempt  to  obtain  the  best  (even  locally)  minimum  error  solution.  However  the  fheoretical 
basis  of  least  mean  square  error  minimisation  allows  a  rigorous  understanding  and  analysis 
on  optimum  networ  .  performance. 

The  output  of  the  k~ th  output  node  of  the  network  may  be  expressed  as 

n0 

Ofc  =  Aok  -  Ajfc0;(y;),  *  =  1,2 . c  (1  ' 

3  =  1 


where  X ^  is  the  connecting  weight  from  the  j-  th  hidden  node  (of  which  there  are  no'  to 
the  A- th  output  node  (of  which  there  are  c).  Aoi  is  a  bias  term  attached  to  the  k- th  outpjt 
node  and  0,(t/;)  is  the  output  from  the  ;-th  hidden  node  in  the  final  hidden  layer  which  is 
a  nonlinear  function  of  the  scalar  input  yy  The  input  is  a  parameterised  function  of 
the  previous  layer  patterns.  For  instance  in  a  multilayer  perceptron 

n 

V)  =  Vo,  ~  3  ~  1 . n°  (2 

.^1 

where  poj  is  a  bias  term.  is  the  weight  connecting  the  i-th  node  of  th?  previous  layer 
to  the  j-th  node  of  the  current  layer,  and  z,  is  the  i-th  component  of  the  patterr  vector 
output  at  the  previous  layer. 

Denoting  the  actual  network  output  vector  of  the  p-  th  pattern  as  o?  and  the  desirH 
prototype  target  pattern  as  fp,  generally  one  wishes  to  minimise  the  error 

*  =  *’-*  1  i  3 
r- 1 

where  dp  is  the  scalar  weighting  associated  with  the  p-th  pattern  and  is  usually  assumed  !■ 
be  unity.  Since  the  network  output  represents  a  (differentiable)  flexible  though  pa, ante 
terised  model,  training  consists  of  adapting  the  parameters  of  the  network  by  an"  suitable 
optimisation  strategy  |21  to  minimise  this  residua]  error.  Generalisation  ability  depend-- 
on  the  network  being  complex  enough  (as  determined  by  tl-  number  of  hidden  units  in  this 
case)  to  model  the  structure  in  the  data  adequately  without  being  too  complex  which  would 
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allow  the  network  to  fit  the  superimposed  noise  on  the  data.  Previous  work  [5]  has  made 
explicit  this  relationship  between  training  and  curve  fitting,  and  between  generalisation  and 
interpolation  to  the  fitted  surface. 

Since  the  output  transfer  functions  are  linear  the  transformation  performed  by  the  final 
layer  of  weights  may  be  inverted  by  pseudo-inverse  techniques  to  determine  the  optimum 
distribution  of  patterns  at  the  output  of  the  no  hidden  nodes  which  minimises  the  error. 
One  finds  that  the  choice  of  weight  parameters  in  the  first  set  of  layers  of  the  network 
distorts  the  training  patterns  by  a  nonlinear  transformation  into  the  space  spanned  by  the 
outputs  of  the  final  hidden  layer.  This  distortion  ir  performed  so  as  to  mar’mise  a  specific 
feature  extraction  criterion, 

J  =  Tr{SBS$}  (4) 

where  Tr{A}  is  tre  trace  and  A*  is  the  Moore-Penrose  pseudo-inverse  of  matrix  A.  The 
mathematical  form  of  the  matrices  5B,  Sj-  are  given  in  Appendix  A.  T..eir  precise  inter¬ 
pretation  depends  on  the  specific  output  target  coding  scheme  and  on  the  errcr  weighting 
factors,  dp.  However  they  may  be  considered  as  the  between  class  and  ;ota]  covar.ance 
matrices  of  the  nonlh  early  transformed  patterns  at  the  outputs  of  the  hidden  layer.  Thus, 
the  optimum  network  solution  is  obtained  by  forcing  the  weights  in  the  primary  layers  of  a 
network  to  produce  a  transformation  of  patterns  in*o  the  space  spanned  by  the  outputs  of 
the  final  layer  of  hidden  units  which  maximises  the  separability  of  the  classes  (through  Sg) 
whilst  maintaining  overall  normalisation  (th'ougl.  S3-).  This  ’ransformation  of  patterns 
is  equivalent  to  an  optimum  feature  e^.. action  crterion  (maximising  (4))  matched  to  the 
(Unear)  discrimination  process  .'f  the  final  layer  weights.  In  this  sense,  feed-forward  lay¬ 
ered  networks  operating  as  classifiers  succeed  because  they  perform  a  sp  cific  discriminant 
analysis  by  exploiting  subspace  methods. 


2.1  Specific  coding  scheme. 

Although  the  generic  expressio:  s  of  the  matrices  Sg.  St  are  given  in  Appendix  A  it  is 
instructive  to  consider  specific  forms  for  different  prototype  target  coding  schemes.  Con 
sider  a  c  class  problem  where  it  is  assumed  that  there  are  ru  framing  patterns  in  class 
it:  it  =  1.2,.  ..c. 


•  Example  1:  <4  =  1  and  1-from-c  target  coding. 

The  desired  prototype  output  target  values  are  fa  =  1  if  the  input  pattern  belongs  to 
class  k  and  zero  otherwise.  This  is  the  most  common  form  of  assumed  target  coding 
scheme  and  error  weighting  used  ii,  adaptive  network  training.  The  matrices  Sg 
Sj  may  be  expanded  to  : 


Sr=lf 


where  **  denotes  the  transpose  of  vector  t.  <y  denotes  the  outpu’  vector  of  the 
final  hidden  layer  for  the  p-th  p  Item.  mH  =  ^2p-  i  dP  /  P  is  the  overall  mean  of  the 
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training  set  and  m k  =  Y<jpek  4?  !ns  >s  the  mean  over  all  patterns  in  class  k  evaluated 

in  the  space  spanned  by  the  outputs  of  the  hidden  units.  It  is  assumed  that  there  are 
ns  patterns  in  the  k-th  class,  so  that  nk/ P  =  Pk  is  the  prior  probability  of  the  k-th 
class. 

It  is  clear  that  the  expression  for  St  in  equation  (5)  is  the  total  covariance  matrix  of 
patterns  <fP  at  the  outputs  of  the  hidden  layer  and  Sg  is  the  weighted  between  class 
covariance  matrix.  The  weighting  is  determined  by  the  square  of  the  number  of 
patterns  in  each  class  in  the  training  set  which  skews  the  feature  extraction  towards 
those  classes  with  more  patterns.  This  indicates  why  networks  trained  on  classifica¬ 
tion  problems  where  there  is  an  uneven  distribution  of  patterns  between  classes  will 
bias  strongly  towards  those  classes  with  largest  membership.  In  actual  operation 
(generalisation  mode)  classes  with  smallest  membership  will  tend  to  be  ignored. 

•  Example  2:  Targets  weighted  by  priors. 

In  this  case  the  desired  output  prototype  target  value  t*  will  be  equal  to  1  /  \rPk  if 
the  input  pattern  belongs  to  class  k  and  zero  otherwise.  Thus  the  gain  in  achieving 
correct  classification  is  inversely  proportional  to  the  number  of  samples  occuring  in  the 
correct  class.  The  total  covariance  matrix,  Sp  remains  the  same,  but  Sb  becomes 


Sb  =  ^2  Ps  (m*/  -  m*)  (m*  -  rnH'j  (6) 

fc=i 

the  conventional  between  class  covariance  matrix  (where  the  classes  are  weighted 
linearly  according  to  the  number  of  patterns  in  that  class.) 

•  Example  3:  dp  =  1  and  arbitrary  loss  factors  for  targets. 

The  desired  prototype  target  vector  for  the  p-th  pattern  p  will  have  components  ts 
which  represent  the  loss  to  be  expected  from  classifying  pattern  p  in  class  k.  Again, 
the  total  covariance  matrix  remains  the  same  but  Sb  is  substantially  modified  to 


Sb  =  ± 
>=1 


E^mf-m") 

.*=1 


t'Animf-m")1 

,*=i 


(7) 


where  Ijk  is  the  loss  incurred  by  ascribing  to  class  j  a  pattern  belonging  to  class  k. 
Note  that  if  =  6jk  then  this  expression  reduces  to  the  usual  weighted  between  class 
covariance  matrix  in  (5). 

•  Example  4:  Weight  each  pattern  residual  of  the  training  error  according  to  the  a 
priori  class  probabilities  and  the  number  of  patterns  in  each  class  according  to 

Plk)  ,  . 

dp  =  — — —  for  the  p-th  pattern  in  the  k-th  class 


where  P(k)  is  the  actual  class  importance  or  frequency  of  occurrence  in  operation, 
coded  as  a  probability,  and  recall  that  Pk  is  the  prior  probability  ol  the  k-th  class  in 
the  training  set.  In  this  case  both  Sp  and  Sb  change.  S j  is  independent  of  the 
particular  target  coding  scheme  and  becomes 

Sr  =  f  —  E  {<?-  ™H)  k  -  W 

‘=i  n*  <t>’ek 
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where  the  sample  based  estimate  of  the  population  mean  mn  now  becomes 


m 


H 


_  P(k) 


E  ^ 

<t>’ek 


=  <T,P(k)m” 

i=i 


(9) 


Thus,  equation  (8)  is  a  sample  based  estimate  of  the  expected  toted  covariance  matrix. 

The  form  of  SB  remains  the  same  as  before,  except  that  the  weighting  factors  are 
determined  by  the  expected  class  importance,  P(k)  instead  of  the  actual  class  prior 
in  the  training  set,  Pi,.  Thus,  for  instance  for  the  1-from-c  target  coding  scheme 
(Example  1),  the  weighted  between  class  covariance  matrix  becomes  the  sample  based 
estimate  of  the  expected  weighted  between  class  covariance  matrix: 

SB  =  £p(k)’(m?-m")(m”-m"y  (10) 

k=l 

and  where  mH  is  the  expected  sample  based  mean  given  in  (9) 


A  brief  summary  of  the  results  obtained  in  this  section  is  justified.  It  has  been  il¬ 
lustrated  that  feed  forward  deterministic  networks  perform  well  as  pattern  classification 
devices  because  the  optimum  subspace  representation  formed  by  the  hidden  layer  executes 
a  specific  feature  extraction  criterion  allowing  an  optimised  discriminant  analysis.  The 
nature  of  the  transformation  is  such  that,  for  a  l-from-c  target  coding  scheme,  the  opti¬ 
mum  network  solution  is  obtained  by  biasing  very  strongly  in  favour  of  those  classes  with 
largest  pattern  membership,  irrespective  of  the  significance  of  that  class.  This  is  primarily 
an  assumption  that  the  expected  occurrence  of  patterns  in  the  test  set  is  the  same  as  in 
the  training  set,  which  is  the  best  assumption  one  can  make  without  any  prior  knowledge. 
However,  it  might  be  known  that  the  occurrence  or  relative  importance  of  classes  in  the  test 
set  is  not  reflected  by  the  frequency  of  occurrence  in  the  training  set.  In  this  case,  one 
can  impose  expected  class  distributions  by  appropriately  weighting  the  error  for  patterns  in 
each  class-  For  instance,  if  in  a  c-dass  classification  problem  the  occurrence  of  each  class 
in  operation  is  considered  equally  likely  but  the  number  of  patterns  in  each  class  in  the 
training  set  is  distributed  with  n*  in  class  k,  then  one  should  weight  each  training  pattern 
according  to 


<4  = 


- — -  for  pattern  p  in  class  k. 

C  *  *  k 


(ii) 


This  will  give  an  equal  importance  to  each  class  in  the  test  set.  In  addition  one  may  have 
knowledge  regarding  the  relative  costs  of  misciassification.  These  effects  may  be  accom¬ 
modated  by  incorporating  off-diagonal  components  in  the  prototype  target  matrix.  The 
effect  of  modifying  the  error  weighting  and  the  target  matrix  has  been  shown  to  be  effective 
by  influencing  the  feature  extraction  criterion,  which  effectively  distorts  the  between  class 
and  total  covariance  matrices  of  the  transformed  patterns. 


The  ‘standard'  model  (Example  1)  is  reproduced  in  the  limit  of  the  expected  prior 
probabilities  being  proportional  to  the  numbers  in  each  class.  The  analysis  has  indicated 
how  a  combination  of  gain  factors  in  the  prototype  target  matrix  and  error  weighting  by 
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the  inclusion  of  expected  probabilities  in  the  training  scheme,  induces  a  nonlinear  trans¬ 
formation  equivalent  to  maximising  a  specific  feature  extraction  criterion  which  is  capable 
of  compensating  for  uneven  inter-class  significance.  These  points  will  be  illustrated  later 
with  regard  to  a  specific  medical  prognosis  problem.  First,  however,  it  will  be  useful  for 
subsequent  purposes  to  point  out  further  advantages  of  training  networks  by  an  optimised 
least  mean  square  error  analysis. 


2.2  Networks  and  Probabilities 


It  is  not  widely  appreciated  that  there  are  intimate  links  between  optimally  trained  networks 
and  traditional  Bayesian  inference  with  regard  to  statistical  pattern  recognition.  This 
subsection  states  a  few  properties  which  relate  to  the  particular  feature  extraction  discussed 
in  the  previous  subsection.  We  discuss  how  the  outputs  of  a  network  may  be  interpreted 
as  probabilities.  This  motivates  the  initialisation  of  a  multilayer  perception  as  a  statistical 
independence  model. 

First,  note  that  an  optimally  trained  network  with  linear  output  units  trained  on  1- 
from-c  prototype  targets  satisfies  an  interesting  property:  the  outputs  of  the  network  for 
arbitrary  input  are  guaranteed  to  sum  to  unity1  (see  Appendix  B).  This  is  interesting  if 
one  wishes  to  interpret  the  network  outputs  as  probabilities.  However,  there  is  no  guaran¬ 
tee  of  positivity  at  all  the  network  outputs,  thus  invalidating  any  strict  classical  probabilistic 
interpretation.  Although  one  can  introduce  a  nonlinear  normalisation  subsequent  to  the 
network  output  to  force  the  outputs  to  be  all  positive  and  sum  to  unity  [4]  regardless  of  the 
method  of  training  the  network,  we  will  remain  within  the  confines  of  our  assumed  model 
and  impose  structure  through  training. 

One  may  motivate  the  idea  that  an  appropriately  trained  network  can  approximate  the 
Bayes  decision  vector  (such  that  the  output  of  each  node  represents  the  probability  of  that 
class  given  the  input  pattern)  by  the  following  example.  Consider  a  network  trained  on 
a  finite  number,  P,  of  input  patterns  where  there  are  c  prototype  target  vectors.  We  also 
assume  duplicity  of  class  membership:  pattern  p  occurs  np*  times  in  class  k.  Then  the 
uniform  (i.e.  <Lp  —  1  )  error  may  be  rearranged  to  give 


p-1 

P=1  k’=l  k=l 


where  P  is  the  number  of  distinct  patterns,  and  tj  is  the  k'  component  of  the  t-th  prototype 
target.  The  optimum  network  output  which  minimises  this  error  is 


<*  = 


Ek  in,ki{k 

£k’=l  npk' 


(13) 


which,  for  a  1-from-c  target  coding  scheme  reduces  to  P(fcjp)  =  npk/^2k=\  np*>  the  proba¬ 
bility  of  the  class  k  occurring  given  input  pattern  p  where  the  probability  density  functions 

’Actually  a  more  general  turn  rule  holds:  if  a  set  of  target  vectors  satisfy  several  linear  constraints 
simultaneously,  then  so  will  the  general  network  output. 
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have  been  approximated  by  the  histograms  of  the  data  distribution.  For  a  1-from-c  out¬ 
put  coding  scheme,  the  best  network  output  (even  assuming  that  the  network  structure  was 
capable  of  capturing  this  relation  in  principle)  attempts  a  sample  estimate  to  the  Bayes 
probabilities.  This  Tesult  is  a  consequence  of  the  coding  scheme  and  least  mean  square 
minimisation.  It  is  not  a  special  property  of  networks.  HoweveT,  it  does  justify  the 
slightly  unorthodox  approach  of  .setting  up  a  network  to  perform  posterior  probability  den¬ 
sity  estimation  based  on  an  assumption  of  statistical  independence,  which  we  now  outline. 


Previous  work  [18]  on  the  medical  data  discussed  in  the  next  section  obtained  the  best 
overall  classification  performance  using  a  statistical  independence  model.  This  model 
assumes  that  the  probability  of  a  pattern,  p  occurring  in  the  i-th  class  s’  is  proportional  to 
the  product  of  the  estimates  of  the  marginal  densities.  Since  the  data  is  categorical  (the 
value  of  each  channel,  r,  representing  one  of  the  R  features  of  the  input  pattern  is  allowed 
only  one  value  y,  of  a  finite  number  of  >1  values),  the  probability  of  the  pattern  given  the 
class  is  given  by 


P(p\i)  0£ 


(  rr  n<(V’ )  +  1/1'r  1 

UA  w  +  i  I 


(14) 


where  the  value  y,  in  channel  r  occurs  m(y,)  times  in  class  t  and  N,{r)  is  the  total  number 
of  times  any  non  zero  value  occurs  in  channel  r  for  class  i.  Note  that  A\(r)  is  not  equal  to 
the  number  of  patterns  in  class  i  due  to  missing  data.  B  is  a  smoothing  factor  [13].  By 
Bayes  theorem,  the  probability  of  the  class  given  the  pattern  is  determined  by  multiplying 
the  above  probability  by  the  prior  probability  of  the  class. 

A  network  which  approximates  this  probabilistic  model  may  be  constructed  as  follows. 
Since  there  are  Y,  levels  for  each  feature,  we  have  n  =  Y,  inputs  in  total,  with  the 

first  Ki  inputs  corresponding  to  feature  1,  the  next  Vj  to  feature  2  and  so  on.  The  input 
coding  for  the  r-th  group  is  ‘1’  in  the  position  corresponding  to  the  feature  value  y,  and 
zero  otherwise.  There  is  a  hidden  unit  for  each  class,  and  there  are  c  linear  output  nodes. 
Let  j  denote  the  position  in  the  input  coding  corresponding  to  the  y,-th  value  of  variable 
r.  Then  the  weight  from  input  unit  s  to  hidden  unit  i  is  given  by 


weight(s,i)  =  B  log 


«.(?,)-*•!  I  Ye 
A’.(r)  +  1 


(15) 


and  there  is  a  bias  term  denoted  by  log  A  where  A  has  yet  to  be  specified.  It  is  clear  that 
using  these  weights,  the  scalar  product  between  an  input  pattern,  p  and  these  weights  gives 


z  =  ^Tweight(s,i)  =  log  P(p!i)  (16) 

»=l 


where  the  last  equality  follows  from  (14).  Thus  for  a  logistic  transfer  function  the  output 
of  node  i  may  be  expressed  as 

fi(p)  =  l/[l  +  exp  -(log  A  +  log  P(p\i)))  =  1/(1  +  AP{p'i)]  (IT) 

Now  the  easiest  way  to  ensure  that  the  output  of  the  network  approximates  the  probability 
of  the  class  given  the  pattern  is  to  have  a  single  direct  connection  from  each  hidden  node 
to  the  output  node  with  a  weight  given  by  -pJA  (all  other  connection  strengths  set  to 
zero)  and  a  bias  term  given  by  p,/ A  where  p,  is  the  prior  probability  associated  with  class 
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i.  Then,  if  A  is  sufficiently  small  we  may  approximate  the  output  of  the  hidden  node  by 
fi  ~  1  -  j4P(p|s)  which  subsequently  gives  the  output  of  the  network  as  an  approximation 
to  the  probability  of  the  class  given  the  pattern.  This  procedure  of  initialising  a  network 
to  give  numbers  which  mimic  posterior  probabilities  at  the  outputs  assuming  independence 
will  be  used  in  the  discussion  of  the  medical  data.  Note  that  this  interesting  relationship 
is  a  consequence  of  an  appropriate  coding  of  the  input  data. 

There  are  additional  interesting  relationships  between  optimally  trained  networks  and 
Bayesian  inference  [22,  8, 15]  in  the  limit  of  an  infinite  amount  of  data.  In  particular  the 
solution  of  the  parameters  which  minimises  the  sum  square  eiTor  (3)  using  1-from-e  targets 
has  minimum  variance  from  the  optimal  Bayes  discriminant  function,  and  making  a  decision 
on  the  basis  of  the  nearest  target  vector  to  an  output  is  the  same  as  picking  the  class  with 
largest  output.  This  approximates  the  optimum  Bayes  solution  for  minimum  error:  i.e. 
maximises  the  likelihood  of  the  class  given  the  pattern.  If  the  prototype  target  matrix 
is  the  ‘equal  cost’  matrix,  it  costs  nothing  to  classify  a  pattern  correctly  (zero  diagonal) 
and  always  costs  unity  for  an  incorrect  classification  (unity  off-diagonal),  then  picking  the 
nearest  target  vector  to  an  output  approximates  the  Bayes  decision  rule  for  minimum  risk. 
For  an  arbitrary  loss  matrix,  such  a  useful  relationship  does  not  hold. 

The  point  is  made  that  appropriately  trained  networks  can  approximate  traditional 
statistical  inference.  How  this  is  achieved  depends  on  the  combination  of  input  and  output 
coding  and  error  weighting.  The  specific  choices  are  determined  by  prior  knowledge  of  the 
data,  and  the  effects  of  these  choices  have  been  outlined  above  in  terms  of  feature  extraction 
criteria.  We  now  illustrate  some  of  the  previous  results  by  considering  a  specific  real 
pattern  processing  problem  taken  from  the  medical  sciences. 
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3  A  Medical  Prognosis  Problem. 

3.1  Discussion  of  the  data. 

The  problem2  is  to  attempt  to  predict  the  future  outcome  (prognosis)  of  patients  suffering 
from  severe  head  injury  on  the  basis  of  data  collected  shortly  after  injury.  Given  that 
patients  suffering  from  this  type  of  injury  tend  either  to  die  very  soon  after  injury,  or  not 
to  progress  significantly  after  a  period  of  the  order  of  six  months,  this  problem  may  be 
considered  as  a  static  pattern  classification  task,  i.e.  given  data  from  a  patient,  what  class 
of  recovery  is  he  likely  to  be  in  after  a  period  of  six  months.  For  the  purpose  of  this 
study,  only  three  classes  are  considered.  A  class  is  chosen  depending  upon  whether  after 
six  months  the  patient 

Class  1:  is  dead  or  vegetative 
Class  2:  has  severe  disability 

Class  3:  has  moderate  disability,  or  shows  good  recovery. 


The  information  on  which  such  a  decision  is  based  comes  from  a  limited  amount  of  data 
which  may  be  obtained  from  a  coma  victim  such  as  the  patient’s  age,  pupils’  sensitivity  to 
light,  motor  response  in  all  four  limbs,  change  in  neurological  function  over  24  hours,  and 
eye  movements.  For  this  experiment,  the  following  six  feature  variables,  or  indicants  were 
used,  along  with  their  coding  schemes: 


Feature 

Coding  Scheme  | 

AGE  (in  decades) 

1-0-9,2-10-  19,. 

..,8-70  + 

EMV  (Glasgow  Coma  Sum) 

0  <—  missing,  l  —  3,2  <— 

4 . 6-8,7-9-15 

MRP  (Motor  Response  Pattern) 

0  —  missing,  1  —  bad, . . 

. ,  7  —  good 

CHANGE 

0  *—  missing,  1  <—  bad. . . 

. ,  3  —  good 

EYE  INDICANT 

0  —  missing,!  «—  bad,. . 

. ,  3  —  good  i 

PUPILS 

0  ♦—  missing,  1  «—  nonreacting ,2  «—  reacting  | 

These  six  features  constitute  a  six  dimensional  feature  vector.  Each  component  is 
either  binary  or  ordered  and  thus  may  be  viewed  as  quantised  levels  of  continuous  variables, 
or  as  discrete  binary  values  where  the  precise  ordering  of  the  values  is  ignored.  Both  input 
coding  schemes  will  be  considered  subsequently.  For  the  purposes  of  our  experiments  the 
data  was  scaled  so  that  each  feature  was  in  the  range  [0, 1]  inclusive. 

The  EM V  score  is  a  coding  for  the  sum  of  three  separate  indicants:  the  Eye  opening 
response  to  stimulation  graded  1  to  4  (normal);  the  Motor  response  of  the  best  limb  to 
stimulation  graded  1  to  6  (normal);  and  the  Verbal  response  to  stimulation  graded  1  to  5 
(normal). 

The  information  constituting  the  data  used  in  this  experiment  was  collected  prospec¬ 
tively  from  1000  patients  who  had  been  in  coma  for  at  least  six  hours.  The  data  collection 
study  lasted  over  a  period  of  8  years  beginning  in  1968,  with  data  obtained  primarily  by 
clinicians  at  the  Institute  of  Neurological  Sciences,  Glasgow,  and  also  from  two  Netherlands 


3See  [18]  for  an  excellent  presentation  of  this  problem  and  data. 
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centres  at  Rotterdam  and  Groningen.  Los  Angeles  subsequently  provided  additional  data. 
The  clinicians  involved  in  this  data  collection  decided  that  the  above  indicants  were  suit¬ 
able  to  be  recorded  reliably  by  different  clinicians  in  different  circumstances  in  different 
countries  (see  [18]  and  references  therein  for  details).  Nevertheless,  it  was  still  not  always 
possible  to  obtain  a  full  set  of  test  results  on  each  patient.  The  occurrence  of  ‘0’  in  any 
feature  accounts  for  those'  instances  when  that  particular  piece  of  information  is  missing. 
This  is  an  added  ‘interesting  feature’  of  this  data  set.  We  have  performed  experiments 
where  the  missing  data  was  treated  as  part  of  the  feature  vector,  although  they  will  not  be 
discussed  here.  The  question  of  what  to  do  with  incomplete  data  sets  is  an  interesting 
statistical  problem,  and  to  what  extent  such  an  anomaly  affects  network  performance  has 
still  to  be  properly  analysed.  The  1000  feature  vectors  were  randomly  split  into  training 
and  test  sets  representing  500  patients  in  each  group  J.  This  division  of  the  data  produced 
differences  in  the  distribution  of  patterns  in  the  classes  between  the  training  and  test  sets 
as  Table  1  shows. 


Frequencies 

Training  Set 

Test  Set 

Class  I 

259 

250 

Class  2 

52 

48 

Class  3 

189 

202 

Table  1:  Pattern  distribution  between  classes  in  the  training  and  test  sets. 


8.2  Problems  with  the  data  set. 

This  is  a  particularly  interesting  data  set  to  work  with  for  a  variety  of  reasons,  most  of 
which  create  difficulties  for  many  techniques.  However,  the  problems  occurring  in  this 
data  set  reflect  trends  which  are  observed  in  many  distinct  real-world  pattern  processing 
tasks,  and  therefore  it  is  valid  to  emphasise  the  more  obvious  features. 


-  Size.  The  first  point  to  note  is  that  it  constitutes  an  unusually  large  data  set  in 
relation  to  the  input  dimensionality  and  the  number  of  classes.  This  is  a  good 
aspect  of  this  data  set  and  particularly  rare  in  medical  applications.  Very  often  in 
applications  of  adaptive  networks  the  size  of  the  training  set  in  terms  of  the  number 
of  constraints  imposed  is  far  too  small  when  compared  to  the  dimensionality  of  the 
problem  reflected  in  the  number  of  adjustable  parameters  in  the  network.  This  leads 
to  an  undesirable  trend  of  a  network  overtraining  on  a  specific  data  set  at  the  expense 
of  good  generalisation  performance:  there  are  not  enough  constraints  to  estimate  the 
adjustable  parameters  reliably  and  the  problem  is  underspecified  leading  to  a  network 
solution  with  zero  error  on  the  training  set.  It  is  almost  always  possible  to  obtain 
zero  error  on  a  finite  training  set  by  increasing  the  complexity  of  the  network,  but 
this  is  not  interesting. 

-  Features:  The  observations  constituting  the  feature  vector  of  an  individual  patient 
are  a  combination  of  specific,  well-categorised  variables  (e.g.  AGE)  and  interde¬ 
pendent  subjective  variables  (e.g.  CHANGE).  An  arbitrary  coding  scheme  has  been 

*We  would  like  to  thank  Dr.  Murray  for  supplying  us  with  the  same  data,  including  the  division  into 
train  and  test  sets,  as  used  in  [18] 
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introduced  to  quantify  the  degree  of  response  for  several  observations.  It  is  not  clear 
how  critical  this  choice  of  coding  is  for  classification.  A  significant  amount  of  prior 
knowledge  and  nonlinear  feature  extraction  has  already  been  performed  on  the  raw 
data  to  produce  the  feature  vectors  which  are  to  be  presented  to  a  classifier  in  a  man¬ 
ner  which  is  almost  certainly  suboptimal.  This  is  a  common  procedure  for  almost 
all  real  pattern  processing  tasks,  in  that  a  significant  amount  of  manual  nonlinear 
feature  extraction  is  performed  generally  prior  to  automatic  pattern  discrimination. 

Missing  data:  Previously,  it  was  mentioned  that  in  spite  of  attention  to  creating  this 
database,  occasionally  some  observations  were  not  made.  Thus,  several  vectors  have 
at  least  one  of  the  features  totally  absent.  This  is  a  major  aspect  of  this  data  since 
206  patterns  out  of  500  training  patterns  and  199  out  of  500  test  patterns  have  at  least 
one  observation  missing.  The  correct  treatment  of  missing  data  is  an  important 
area  of  study  by  itself  [17,  12].  Although  it  is  possible  to  initialise  networks  to 
accommodate  missing  data  by  working  with  the  class-conditional  distributions  and 
assuming  statistical  independence  (this  will  be  used  later),  a  trained  network  loses 
this  independence  assumption.  In  spite  of  the  importance  of  the  correct  procedure 
for  dealing  with  missing  data,  its  detailed  discussion  would  detract  from  the  main 
aims  of  this  paper.  Therefore,  for  simplicity  the  (suboptimal)  method  adopted  in 
the  experiments  is  to  replace  the  missing  data  in  the  training  set  by  values  selected 
randomly  according  to  the  distributions  as  determined  from  the  observed  components, 
and  in  the  test  set  by  the  means  estimated  from  the  training  set. 

Uneven  class  distribution:  Another  major  aspect  of  this  data  set  is  the  fact  that  the 
distribution  of  patterns  between  the  classes  is  strongly  nonuniform  (Table  1).  Since 
least  mean  square  error  training  of  networks  forces  the  hidden  unit  space  to  maximise 
a  feature  selection  criterion  which  employs  a  (squared)  weighted  between  class  covari¬ 
ance  matrix,  the  optimum  network  solution  tends  to  ignore  the  least  represented  class 
(class  2  in  this  case).  This  is  a  generic  result  which  could  lead  to  undesirable  conse¬ 
quences  in  problems  where  it  was  required  that  the  classes  were  uniformly  weighted. 
In  addition,  there  is  a  slight  unevenness  between  the  train  and  test  set  distributions, 
but  this  would  not  be  expected  to  be  significant  in  this  problem.  However  it  is 
usually  assumed  in  network  training  schemes  that  the  distribution  of  patterns  in  the 
training  set  is  representative  of  the  distribution  of  the  patterns  in  the  test  set.  In 
problems  where  it  is  very  difficult  to  obtain  information  on  one  particular  class,  but 
it  is  very  easy  to  obtain  data  on  the  remaining  classes,  and  it  is  desirable  to  optimise 
the  correct  classification  of  the  difficult  class,  one  has  to  compensate  for  the  expected 
error  by  exploiting  knowledge  of  the  a  priori  probabilities. 

Uneven  class  importance:  This  is  related  to  the  previous  comment.  Most  medical 
prognosis/diagnostic  problems  have  an  uneven  importance  attached  to  each  class.  It 
is  more  important  to  diagnose  serious  ailments  correctly  than  to  diagnose  psychoso¬ 
matic  problems  correctly.  If  the  data  set  is  loaded  against  diagnosing  the  serious 
complaints  correctly  (by  not  having  sufficient  patterns  in  that  class  in  the  training  set) 
it  is  important  to  increase  the  priors  associated  with  that  class  artificially.  In  this 
particular  case,  it  is  most  important  to  obtain  a  correct  prognosis  on  those  patients 
in  class  2,  since  they  are  the  ones  requiring  long  term  rehabilitation.  Although  we 
can  assume  that  the  distribution  of  patterns  in  the  training  set  is  representative  of 
the  patterns  occurring  in  practice,  the  fact  that  class  2  has  the  least  membership 
implies  that  network  training  will  need  to  be  biased  by  exploiting  the  priors  in  order 
to  increase  the  recognition  of  patterns  in  that  class. 
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Cross-Class  Penalties:  In  addition  to  class-conditional  priors,  there  is  often  an 
associated  cross-class  penalty  index.  There  is  a  larger  penalty  associated  with  diag¬ 
nosing  a  serious  ailment  as  psychosomatic,  rather  than  the  other  way  around.  This 
is  an  effect  which  cannot  be  compensated  for  by  altering  the  priors,  and  an  explicit 
cost  matrix  has  to  be  used  to  modify  the  target  prototype  vectors.  For  this  particu¬ 
lar  data  set,  a  cost  matrix  has  been  devised  on  subjective  grounds  by  neurosurgeons 
involved  with  this  study.  This  matrix  reflects  how  serious  the  neurosurgeons  would 
judge  the  different  misclassification  errors.  The  matrix  has  the  specific  form 


Actual 


Predicted 
0  10  75  \ 

10  0  90 

750  100  0  ) 


(18) 


For  instance,  it  is  ten  times  more  serious  (clinically)  to  make  a  prognosis  that  a  patient 
who  recovers  (class  3)  is  predicted  to  die  (class  1)  rather  than  predict  incorrectly  that 
a  patient  who  would  die  should  recover.  Later  the  effects  of  exploiting  this  cost 
matrix  on  a  network’s  classification  performance  will  be  illustrated  by  modifying  the 
targets  as  outlined  in  the  previous  section. 

Ambiguous  data:  Another  aspect  of  this  data  set  is  that  the  three  classes  are  ex¬ 
tremely  overlapping:  the  three  classes  are  not  easily  separable  into  distinct  clusters. 
Indeed,  a  few  percent  of  the  patterns  occur  more  than  once  but  with  correspondingly 
different  target  vectors  (22  distinct  vectors  in  the  training  set  and  24  in  the  test  set 
belong  to  more  than  one  distinct  class).  This  implies  ambiguous  data  leading  to  in¬ 
trinsically  inconsistent  training  (technically,  the  actual  mapping  implied  by  the  data 
is  not  a  function  at  all,  and  networks  are  only  capable  of  functional  interpolation). 
The  best  performance  one  could  expect  in  such  circumstances  is  to  reproduce  the 
likelihood  of  the  class  given  the  pattern  (see  section  2.2). 

Multi-centre:  Finally,  although  the  collection  of  this  data  represents  an  admirable 

collaboration  between  establishments  in  different  countries,  the  geographic,  social  and 
cultural  differences  in  the  behaviour  of  the  populations  should  lead  to  biases  in  the 
data  collected  from  each  centre.  The  type  and  severity  of  head  injury  sustained  is 
likely  to  be  different  from  country  to  country,  for  instance  because  of  the  different 
regulations  governing  the  wearing  of  crash  helmets.  As  the  data  was  pooled,  these 
trends  have  been  dissipated  in  the  data  set,  potentially  leading  to  anomalies.  Such 
anomalies  are  likely  to  arise  in  other  real  pattern  processing  problems  when  data  is 
collected  over  a  period  of  time,  at  different  locations,  hy  different  people  using  different 
experimental  techniques  or  different  monitoring  equipment  etc. 


These  points  illustrate  that  this  is  a  particularly  interesting  data  set,  not  because  it 
is  too  difficult  to  obtain  good  performance  by  network  techniques,  but  because  there  are 
several  aspects  in  the  data  set  which  are  likely  to  be  reflected  in  almost  any  real  pattern 
processing  problem.  The  first  rule  of  applying  adaptive  network  techniques  to  a  given 
application  is,  before  anything  else  study  and  understand  the  limitations  of  the  data.  This 
emphasises  the  need  for  baseline  performance  figures  on  a  given  data  set,  for  instance  by- 
other  traditional  pattern  classifiers. 
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3.3  Standard  Classification  Results 


As  with  all  other  techniques  of  statistical  pattern  recognition,  it  is  important  to  assess  the 
capabilities  of  adaptive  networks  against  a  baseline  performance  expectation.  In  this  paper 
the  expectation  is  provided  by  the  classification  abilities  of  a  range  of  standard  techniques 
applied  to  the  same  medical  data  set.  The  specific  methods  used  are:  Distance  to 
class  mean  (DCM),  Gaussian  classifier  (GC),  nearest  neighbour  (NN),  K-ne&zest  neighbour 
(KNN),  Optimum  Linear  Transformation  (OLT)  and  a  Statistical  Independence  (SI)  model. 
A  brief  discussion  of  each  of  these  methods  is  presented  in  Appendix  C.  For  a  different 
range  of  techniques  applied  to  this  data  set  see  [18].  The  statistical  independence  model 
is  the  same  as  employed  in  [18]  and  has  been  included  here  to  illustrate  the  reproducibility 
of  the  results  obtained  in  that  paper.  For  instance,  using  the  original  data  (unsealed  and 
unsubstituted),  the  best  result  obtained  on  the  test  tet  using  the  statistical  independence 
model  with  B  =  1  returned  75.2%  correctly  classified  patterns  (with  74.6%  on  the  training 
set).  This  is  the  same  as  reported  in  [18].  However,  the  best  result  on  the  training 
set  gave  worse  results  on  the  test  set.  Thus,  these  figures  are  biased  estimates  of  the 
statistical  independence  model  performance.  It  is  interesting  to  note  that  in  this  solution, 
the  statistical  independence  model  classified  correctly  only  3  patterns  out  of  48  in  class  2 
on  the  test  set  (compared  to  203  out  of  250  for  class  1  and  170  out  of  202  from  class  3). 

Figure  1  depicts  the  classification  performance  of  each  of  the  standard  techniques  on 
the  zero  substituted  data  set.  Note  that  the  nearest  neighbour  classifier  does  not  achieve 
100%  on  the  training  set  as  would  generally  be  expected.  This  is  due  to  the  inconsis¬ 
tent  nature  of  the  data  leading  to  unresolvable  ties  for  correct  classification.  Interestingly 
(and  anomalously)  the  Optimum  Linear  Transformation  performs  well  on  this  data.  This 
technique  has  the  closest  relationship  to  the  network  models,  although  networks  may  be 
initialised  according  to  the  statistical  independence  model  (as  has  already  been  discussed) 
and  to  a  nearest  distance-to-class~mean  classifier  [2j.  Also  note  that  the  statistical  in¬ 
dependence  model  which  gave  the  best  test  set  results  overall  only  ranked  fourth  on  the 
training  set. 
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Figure  1:  Graph  depicting  the  total  number  of  correctly  classified  patterns  on  the  training 
set  (open  bars)  and  the  test  set  (solid  bars)  for  a  range  of  classification  techniques.  The 
Class  1  bars  denote  the  performance  obtained  by  classifying  everything  as  class  1. 
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However,  these  performance  figures  do  not  convey  the  accuracy  with  which  each  class 
was  separately  classified.  This  information  may  be  observed  in  the  confusion  matrices  of 
each  classifier.  Figure  2  and  Figure  3  show  the  confusion  matrices  produced  on  the  training 
and  test  sets  by  two  of  the  best  techniques  the  optimum  linear  transformation  model  and 
the  If -nearest  neighbour  model.  Note  that  the  criterion  for  ‘best’  is  simply  the  one  which 
gave  the  maximum  number  correct  overall  on  the  test  set.  This  is  not  the  criterion  that 
will  be  used  to  assess  network  performance.  Neither  is  it  a  particularly  useful  criterion 
for  this  medical  data  problem  since  it  takes  no  account  of  costs.  This  is  reflected  in  the 
fact  that  the  ‘better’  techniques  achieve  their  superior  performance  at  the  expense  of  totally 
misclassifying  class  two,  as  is  evident  by  examining  the  confusion  matrices.  This  same 
trend  will  be  observed  in  the  naive  application  of  network  techniques  where  minimising 
the  residual  error  is  achieved  by  ignoring  the  class  least  represented.  Figure  4  shows  the 
confusion  matrices  as  obtained  by  the  statistical  independence  model  on  this  data.  Again, 
there  is  a  dominant  trend  to  misdassify  class  2  patterns  at  the  expense  of  patterns  from 
classes  I  and  3.  In  each  figure  the  ‘average  loss’  is  displayed  below  each  confusion  matrix. 
This  loss  is  obtained  by  multiplying  the  confusion  matrix  by  the  clinicians’  loss  matrix 
element  by  element,  summing  and  dividing  by  the  total  number  of  examples. 
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Figure  2:  The  confusion  matrices  of  the  Optimum  Linear  Transformation  classifier. 
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Figure  3:  The  confusion  matrices  produced  by  the  best  (as  determined  on  the  test  setl) 
KNN  model  (using  K  =  18). 


3.4  Network  Results. 

The  adaptive  network  used  in  the  simulations  has  a  standard  feed  forward  architecture 
with  logistic  nonlinearities  at  the  hidden  nodes  but  with  linear  transfer  functions  at  the 
output  nodes.  However  the  training  method  employed  is  not  standard  and  consequently 
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Figure  4:  The  confusion  matrices  produced  by  the  best  (as  determined  on  the  test  set!) 
Statistical  Independence  model  (using  B  =  1). 


warrants  a  brief  discussion.  The  function  evaluated  by  such  a  network  is  continuous  and 
differentiable.  Since  the  residual  error  to  be  minimised  is  analytic,  any  standard  nonlinear 
least  mean  squares  optimisation  technique  which  employs  knowledge  of  a  function  and  its 
first  derivative  may  be  used.  Since  the  total  number  of  weights  in  a  network  for  this 
medical  problem  is  likely  to  be  small  (<  100)  we  have  previously  found  [21,  20]  that  a 
quasi-Newton  method  is  an  efficient  procedure  for  obtaining  a  local  minimum  of  the  error 
function.  In  addition  it  is  natural  to  consider  the  various  layers  of  a  feed  forward  network 
as  evolving  on  different  time  scales:  the  final  layer  weights  adapting  rapidly  to  the  varying 
first  layer  weights.  In  this  sense  the  final  layer  weights  are  slaved  to  the  behaviour  of  the 
first  layer  weights.  Since  the  output  nodes  are  linear,  given  the  set  of  patterns  at  the 
output  of  the  hidden  layer  the  final  layer  weights  may  be  obtained  instantaneously  (on  the 
time  scale  of  the  previous  layer)  by  a  pseudo-inverse  method.  Thus,  the  training  method 
adopted  for  this  network  is  as  follows.  The  network  is  initialised  with  random  weights 
in  an  appropriate  range.  The  first  layer  weights  are  incrementally  varied  in  a  direction 
determined  by  the  Broyden-Fletcher-Goldfarb-Shanno  (BFGS)  quasi-Newton  nonlinear 
optimisation  scheme  [6].  As  the  (slow)  search  for  a  local  minimum  continues  in  the  weight 
space  spanned  by  the  first  layer  weights,  the  final  layer  weights  are  adapted  instantly  to  any 
change  so  as  to  always  remain  in  a  (global)  minimum  in  the  weight  space  spanned  by  the 
final  layer  weights.  This  ensures  that  the  network  as  a  whole  is  guaranteed  to  be  in  a  local 
minimum  once  the  first  layer  weights  have  found  a  locally  optimum  position.  We  have 
already  seen  that  this  optimum  position  corresponds  to  a  nonlinear  transformation  which 
maximises  a  specific  feature  selection  criterion.  The  multiple  time  scale  description  implies 
that  feature  selection  emerges  slowly  compared  to  the  (linear)  automatic  classification  part 
of  the  network. 

The  choice  of  network  complexity  also  warrants  a  brief  discussion.  As  the  number 
of  hidden  nodes  increases,  it  is  possible  to  fit  the  training  data  more  tend  more  accurately. 
Beyond  a  certain  number,  the  noise  on  the  data  as  well  as  the  hidden  structure  of  the  data  is 
also  being  fitted.  This  is  overtraining  and  the  generalisation  error  will  begin  to  fluctuate. 
For  this  problem,  it  was  decided  (on  the  criterion  of  minimum  generalisation  error)  that 
three  hidden  units  was  sufficient  to  model  the  structure  in  the  data  adequately.  The 
specific  choice  is  not  crucial  for  the  purposes  of  this  paper,  as  the  issues  we  are  concerned 
with  are  given  a  network,  what  is  the  effect  on  the  feature  extraction  mechanism  and 
subsequent  classification  of  changing  the  priors  and  the  targets.  Thus,  we  choose  to  fix 
the  number  of  hidden  units  for  each  set  of  experiments  to  allow  comparisons  to  be  made 
across  experiments. 
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3.4.1  The  ‘standard’  result. 

The  standard  experiment  consists  of  using  1-from-c  target  coding  and  no  extra  priors 
(dp  =  1  V  p).  FYom  100  different  random  weight  starts,  the  classification  results  of  the 
trained  network  with  smallest  error  on  the  training  set  are  presented  in  Figure  5.  Note 
that,  as  predicted,  the  network  has  achieved  the  minimum  error  solution  by  totally  ignoring 
the  class  with  smallest  membership.  The  figures  also  present  the  average  loss  associated 
with  the  experiment.  The  significance  of  the  average  loss  will  be  apparent  only  when  we 
incorporate  the  cost  matrix  as  port  of  the  training  phase  of  the  networks. 
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Figure  5:  The  cordVion  matrices  produced  by  the  best  (as  determined  by  the  smallest 
training  error)  'standard'  network  (6-3-3). 


3.4.2  1-from-c  targets,  equal  priors 

In  this  experiment,  the  prototype  target  matrix  is  the  identity  matrix,  but  the  prior  are 
adjusted  to  be  equal  on  the  test  set  (thus  dp  =  P(i)/(P,)  sx  1  fP,  if  th"  p-th  pattern  is  in  the 
i-th  class).  Strictly  speaking,  this  compensation  is  not  necessary  for  this  data  set  since 
the  distribution  of  patterns  between  the  classes  is  approximately  the  same  between  the  train 
and  the  test  sets.  Nevertheless  it  does  reveal  how  the  relative  importance  of  classifying 
correctly  class  2  can  be  increased.  Figure  6  shows  the  misdassification  matrices  obtained 
from  the  network  solution  with  smallest  training  error.  It  is  noted  that  the  network  is 
recognising  correctly  several  patterns  from  class  2,  although  there  is  a  high  misdassification 
rate  from  class  1  into  class  2.  The  training  error  is  slightly  worse  and  the  average  loss  is 
slightly  worse  than  the  standard  case. 


3.4.3  Targets  taken  from  the  loss  matrix,  no  priors. 

In  this  experiment,  the  prototype  target  matrix  is  the  loss  matrix  subjectively  obtained 
by  the  group  of  clinicians  involved  in  the  study.  The  priors  are  not  compensated  for 
(thus  dp  b  (priors  on  the  test  set ) / (priors  on  the  training  set)  =  IV  p).  Figure  7  shows 
the  misdassification  matri  es  obtained  from  the  network  solution  with  smallest  training 
error.  Now,  as  well  as  raising  the  recognition  performance  of  the  minority  class,  the 
average  loss  is  significantly  reduced.  This  has  been  at  the  expense  cf  overall  recognition 
performance.  However  it  does  indicate  the  effects  of  minimising  a  cost  function  which 
incorporates  misdassification  costs. 
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Figure  6:  The  confusion  matrices  produced  by  the  best  (»s  determined  by  the  smallest 
training  error)  network  with  equalising  the  priors. 
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Figure  7:  The  confusion  matrices  produced  by  the  b  t  (as  determined  by  the  smallest 
training  error)  network  incorporating  the  inisclassification  costs  in  the  training  process. 


3.4.4  Targets  taken  from  the  loss  matrix,  equal  priors. 

In  this  experiment.  again  the  prototype  target  matrix  is  the  loss  matrix.  In  addition  the 
priors  ate  equalised  on  the  test  set  (thus  dp  tx  1/P,).  Figure  8  shows  the  misclassification 
matrices  for  this  experiment  (choosing  the  smallest  training  error  over  100  separate  trials). 
This  network  has  a  predominant  tendency  to  misdassify  class  1  patterns  as  class  2.  This 
is  ar  inappropriately  trained  network  since  equalising  the  priors  and  incorporating  misdas- 
sification  costs  into  the  training  unfairly  biases  against  the  high  cust  class  since  the  test 
and  train  distributions  are  approximately  equal.  However,  Figure  8  does  make  the  point 
that  class  3  patterns  are  not  very  likely  to  be  classified  incorrectly  as  class  1  (i.e.  it  is 
important  not  to  make  the  prognosis  that  a  patient  who  will  recover  is  likely  to  die  — 
particularly  if  this  decision  affects  the  operation  of  a  life-support  system  on  that  patient). 
Alternatively,  it  is  comparatively  easy  to  misdassify  patterns  'torn  class  1  as  from  class  3 
(it  is  not  too  serious  to  predict  a  patient  will  recover  if  he  actually  dies,  as  far  a.  th"  health 
of  the  patient  is  concerned). 


3.4.5  Class-weighted  targets,  no  priors 

In  this  experiment,  the  target  vector  for  a  training  pattern  is  1  \'7\  if  the  pattern  is  in 
class  i,  and  zero  otherwise.  We  have  seen  previously  that  this  forces  the  feature  selection 
criterion  to  employ  the  conventional  between  class  covariance  matrix  which  does  not  bias 
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Figure  8:  The  confusion  matrices  produced  by  the  best  (as  determined  by  the  smallest 
training  error)  network  incorporating  the  misdassification  costs  and  compensating  for  the 
priors. 


so  heavily  in  favour  of  the  most  represented  class.  The  decision  regions  are  still  influenced 
by  the  numbers  in  each  class  though.  It  is  seen  in  Figure  9  that  a  small  proportion  of 
class  2  are  now  recognised  correctly  but  the  minimum  error  solution  is  still  predominantly 
to  misdassify  class  1  as  class  3  and  class  3  as  class  1. 
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Figure  9:  The  confusion  matrices  produced  by  the  best  (as  determined  by  the  smallest 
training  error)  network  with  class-weighted  targets,  not  compensating  for  the  priors. 


3.5  Network  results  using  binary  input  coding. 


The  previous  results  have  employed  a  coding  scheme  for  the  input  patterns  which  assume 
quantised,  ordered  features  leading  to  a  six  dimensional  continuous  input.  An  obviou- 
alternative  input  coding  scheme  is  to  assume  that  the  value  of  each  variable  is  independent . 
leading  to  a  30  dimensional  binary  input  pattern  for  this  medical  data.  An  advantage 
of  this  input  coding  scheme  for  networks  is  that  the  structure  may  be  mapped  onto  the 
statistical  independence  model  as  discussed  in  section  2.2.  This  allows  a  network  to 
be  initialised  with  a  ‘sensible'  weight  configuration  prior  to  optimisation.  The  following 
results  present  the  various  confusion  matrices  using  the  binary  input  coding  and  the  various 
combinations  of  error  weighting  and  cost  matrices. 
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3.S.1  The  ‘standard’  result. 


Employing  1-from-c  targets,  ^  =  lVp  and  a  30  -  3  -  3  network,  the  best  (i.e.  the  one 
returning  the  smallest  residual  training  error)  experiment  from  100  different  initial  random 
weight  starts  gave  the  confusion  matrix  in  Figure  10,  Generally,  the  performance  is  better 
than  the  corresponding  result  using  continuous  input  coding.  Specifically,  using  binary 
input  coding  gave  more  correct  on  training,  the  normalised  training  error  was  slightly  less, 
the  average  loss  was  less  and  it  also  succeeded  in  correctly  classifying  many  of  the  class  2 
patterns.  However  the  average  loss  on  test  is  significantly  higher. 
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Figure  10:  The  confusion  matrices  produced  by  the  best  (as  determined  by  the  smallest 
training  error)  network  on  the  standard  network  using  binary  input  coding. 


3.5.2  The  ‘standard’  network  initialised  as  a  statistical  independence  model. 


This  experiment  used  the  same  input  and  output  coding  as  the  previous  case.  In  this 
example,  the  network  weights  were  initially  configured  to  reproduce  the  results  as  obtained 
by  a  statistical  independence  model.  This  does  not  correspond  to  a  minimum  of  the 
network’s  residual  error  and  so  when  optimisation  proceeds,  the  weights  adapt  away  from 
the  preset  values  to  find  a  smaller  error  configuration.  The  confusion  matrices  are  depicted 
in  Figure  11.  In  terms  of  training  error,  this  configured  network  does  not  achieve  as  low 
a  training  and  test  error  as  the  best  of  the  random  configurations  presented  in  Figure  10. 
It  does  return  slightly  better  classification  performance  figures.  However,  the  differences 
are  so  small  that  one  may  take  Figure  11  as  indicative  that  it  is  useful  to  exploit  prior 
knowledge  to  initialise  a  network  in  a  particular  configuration  prior  to  the  optimisation 
scheme.  This  is  particularly  so  if  the  optimisation  experiments  are  likely  to  be  extensive 
and  time  consuming  to  find  a  suitable  initial  random  start  configuration  for  the  weights 


3.5.3  1-from-c  targets,  equal  priors 


The  prototype  target  matrix  is  the  identity  matrix,  and  the  priors  are  adjusted  to  be  equal 
on  the  test  set.  Comparing  the  results  in  Figure  12  with  those  of  the  corresponding 
continuous  case  (see  Figure  6),  the  training  performance  is  better  in  this  experiment  but 
the  generalisation  is  worse. 
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Figure  11:  The  confusion  matrices  produced  by  the  standard  network  initialised  as  a 
statistical  independence  model. 
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Figure  12:  The  confusion  matrices  produced  by  the  network  with  binary  inputs,  the 

identity  as  prototype  target  vectors  and  the  priors  adjusted  to  equalise  expectation  on  the 
test  set. 


3.5.4  Targets  taken  from  the  loss  matrix,  no  priors. 


The  priors  are  not  compensated  for,  and  the  prototype  target  vectors  are  taken  from  the 
clinicians’  cost  matrix.  The  confusion  matrices  of  the  best  experiment  out  of  ]P0  random 
starts  are  depicted  in  Figure  13  which  should  be  compared  with  Figure  7.  Because  the 
costs  are  being  incorporated  into  the  training  phase  there  is  a  tendency  to  mtsclassifv  class 
1  patterns  significantly.  This  gives  a  small  overall  loss  on  the  training  and  test  sets,  but 
also  returns  poor  overall  classification  results.  These  results  are  not  as  good  as  those  using 
continuous  features. 


3.5.5  Targets  taken  from  the  loss  matrix,  equal  priors. 


The  priors  are  compensated  for  so  that  the  test  set  expectation  is  equal,  and  the  prototype 
target  vectors  are  taken  from  the  clinicians’  cost  matrix.  The  confusion  matrices  of  the 
best  experiment  out  of  100  random  starts  are  depicted  in  Figure  14.  The  corresponding 
results  for  the  continuous  input  coding  scheme  are  shown  in  figure  8.  The  effect  on 
class  1  classification  of  equalising  the  priors  is  even  more  drastic.  All  class  1  patterns 
are  incorrectly  classified  in  either  the  train  or  test  sets.  However,  the  overall  loss  is 
significantly  small  in  both  instances. 
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Figure  13:  The  confusion  matrices  produced  by  the  network  with  binai.,  inputs  using  the 
clinicians’  cost  matrix  as  prototype  target  vectors  and  not  compensating  for  the  priors. 
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Figure  14:  The  confusion  matrices  produced  by  the  network  with  binary  inputs  using  the 
clinicians’  cost  matrix  as  prototype  target  vectors  as  well  as  equalising  for  the  priors  on  the 
test  set. 


3.5.6  Class-weighted  targets,  no  priors 

The  target  vector  for  a  training  pattern  is  \l\f~F,  if  the  pattern  is  in  class  i,  and  zero 
otherwise.  The  priors  are  not  compensated  for.  Since  the  costs  are  not  incorporated  into 
the  training,  this  experiment  misdassifies  class  2  to  achieve  a  minimum  error  as  observed  in 
the  results  of  Figure  15.  These  results  are  slightly  better  than  those  in  the  corresponding 
experiment  using  an  ordered  input  coding  (Figure  9). 


3.5.7  Targets  taken  from  the  scaled  loss  matrix,  no  priors. 


It  is  interesting  to  consider  the  effects  on  training  performance  that  scaling  the  cost  matrix 
has.  In  principle  the  absolute  values  of  the  elements  constituting  the  cost  matrix  do 
not  affect  the  network’s  performance.  It  is  only  the  relative  values  which  are  important. 
However  in  practice  the  absolute  values  of  the  targets  influence  the  iterative  optimisation 
scheme  due  to  the  step  lengths  taken.  The  results  in  Figure  16  depict  an  experiment 
where  the  priors  have  not  been  compensated  for  and  the  target  coding  is  taken  from  a 
scaled  version  of  the  clinicians’  cost  matrix  where  each  loss  is  scaled  down  by  a  factor  of 
100.  It  is  interesting  to  note  that  the  training  error  in  this  case  is  significantly  less  than 
the  corresponding  case  working  with  the  cost  matrix  directly.  In  addition,  it  took  many 
more  iterations  to  reach  the  minimum  in  the  scaled  case  indicating  that  the  minimum 
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Figure  15:  The  confusion  matrices  produced  by  the  network  with  binary  inputs.  The 
prototype  target  vectors  are  class  weighted  by  the  square  root  of  the  numbers  in  that  class 
and  the  priors  are  not  compensated  for. 


found  without  scaling  the  cost  matrix  was  a  poor  local  minimum.  The  effect  of  the 
better  minimum  is  to  find  a  solution  with  a  much  smaller  average  cost  (10.74  compared  to 
20.61)  and  a  much  better  overall  classification  performance  (372  correct  on  training  and  315 
correct  on  test  compared  to  282  and  269  correct  respectively  in  the  unsealed  case).  The 
normalised  test  error  was  slightly  worse  than  the  unsealed  case. 
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Figure  16:  The  confusion  matrices  produced  by  the  network  with  binary  inputs  using  the 
clinicians’  cost  matrix  scaled  down  by  a  factor  of  100  as  prototype  target  vectors.  The 
priors  were  not  compensated  for. 
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4  Discussion 

The  intention  of  this  paper  has  been  to  expand  and  explore  relationships  between  adaptive 
networks  trained  ‘optimally’  and  feature  extraction  criteria,  and  the  relevance  of  these  links 
in  the  context  of  a  real-world  classification  problem  (medical  prognosis).  In  particular  it 
was  stressed  that  networks  perform  feature  extraction  and  classification  simultaneously,  and 
the  specific  feature  extraction  criterion  depends  on  the  classification.  In  addition,  it  was 
necessary  to  emphasise  the  significance  of  input /output  coding  schemes  for  network  training 
and  what  this  implied  for  the  interpretation  of  actual  network  outputs  in  a  probabilistic 
context.  This  led  to  a  novel  scheme  where,  under  appropriate  circumstances,  the  set  of 
network  parameters  may  be  initialised  so  that  the  network  performs  the  same  task  as  a  stan¬ 
dard  statistical  independence  model.  Subsequent  optimisation  adapted  these  parameters 
to  reduce  the  error,  but  at  the  expense  of  classification  for  the  medical  prognosis  problem 
considered.  Numerical  comparisons  were  made  between  network  results  and  a  range  of 
standard  pattern  classification  techniques  on  the  medical  data.  Finally,  a  comparison  of 
results  obtained  by  a  fixed  network  structure  but  utilising  various  combinations  of  coding 
schemes  and  error  weightings  was  presented. 

The  main  theme  of  the  paper  has  been  to  discuss  that  group  of  classification  problem 
where  the  relative  importance  of  the  classes  follows  a  different  distribution  to  their  prior 
occurrence  in  the  training  set.  In  this  general  theme,  two  connected  issues  were  discussed. 
In  one  case,  the  distribution  of  patterns  amongst  the  various  classes  may  be  the  same 
between  the  training  and  the  test  sets,  but  there  may  be  a  nonuniform  penalty  associated 
with  misclassification.  This  is  particularly  so  for  the  medical  data  used,  since  it  is  more 
important  to  try  and  diagnose  that  class  of  patient  who  is  likely  to  survive  but  will  require 
long  term  care.  With  prior  knowledge  of  the  loss  matrix,  the  distribution  of  patterns  in 
recognition  mode  may  be  varied  by  incorporating  the  loss  matrix  in  the  training  process 
itself.  This  redistribution  of  patterns  is  produced  by  a  feature  extraction  transformation 
which  attempts  to  allow  a  decision  to  be  made  which  relates  to  minimising  the  overall 
expected  loss.  This  redistribution  is  evident  in  the  confusion  matrices.  The  anomaly 
with  the  second  group  of  problems  is  that  the  expected  occurrence  of  classes  in  the  test  set 
may  be  different  to  the  distribution  of  patterns  between  the  classes  in  the  available  training 
set.  Since  minimising  the  uniform  error  maximises  a  feature  extraction  criterion  which 
weights  in  favour  of  those  classes  most  represented,  the  error  has  to  be  weighted  according 
to  the  expected  probability  distribution  of  the  classes  in  operation.  Using  appropriate 
error  weighting  factors,  minimising  the  error  distorts  the  effective  between  class  covariance 
matrix  of  the  nonlinearly  transformed  patterns  so  as  to  produce  classification  results  which 
reflect  the  expected  class  distribution. 

For  certain  choices  of  coding  scheme  it  was  shown  that  the  optimum  network  solution 
attempted  to  reproduce  the  Bayes  decision  vector  and  choosing-the-nearest  decision  rule 
corresponded  to  maximising  the  likelihood,  or  minimising  the  expected  risk  in  a  Bayesian 
sense.  When  one  is  not  in  this  optimum  situation,  the  outputs  of  the  model  network  do 
not  reflect  probabilities  (although  they  may  still  be  employed  as  calculational  devices  in 
any  subsequent  decision-making  process)  and  it  is  not  obvious  what  is  the  best  decision 
rule  to  apply.  For  consh'ency  with  the  least-mean  square  approach,  we  have  chosen  to 
always  use  the  pick-the-dosest  decision  rule.  The  precise  interpretation  of  the  outputs 
of  an  approximate  network  model  and  how  these  outputs  should  be  employed  in  a  decision 
rule  is  an  area  of  research  to  be  continued. 
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A  discussion  of  the  peculiar  network  optimisation  strategy  that  is  preferred  in  our  sim¬ 
ulations  was  given.  This  introduces  the  concept  of  the  different  dynamics  of  the  various 
weights,  depending  on  the  layer.  Specifically,  the  weights  performing  the  feature  extrac¬ 
tion  transformation  evolve  on  a  slow  time  scale  compared  to  the  speed  of  adaptation  of  the 
final  layer  weights  performing  the  classification. 

Having  established  a  sound  framework  for  the  network  methodology,  the  techniques 
were  applied  to  a  specific  real  world  pattern  processing  problem:  medical  prognosis  of 
head-injured  coma  patients.  This  is  a  particularly  useful  data  set  with  many  problems 
and  advantages  as  discussed  in  the  text.  In  particular,  a  standard  network  trained  on 
this  data  minimised  the  overall  error  by  misdassifying  most  of  the  patterns  in  class  2,  in 
common  with  many  other  standard  pattern  recognition  techniques.  It  was  numerically 
demonstrated  that  the  distribution  of  predicted  classes  in  the  test  set  may  be  manipulated 
by  an  appropriate  choice  of  target  coding  and  error  weighting  (although  for  this  data  set 
the  distribution  of  patterns  between  classes  is  approximately  uniform  between  the  train 
and  test  sets).  The  numerical  results  corroborate  the  theoretical  expectations  given  in  the 
earlier  sections,  in  that  incorporating  the  loss  matrix  allows  a  solution  to  be  obtained  with 
a  smaller  overall  risk  by  forcing  a  redistribution  of  elements  in  the  confusion  matrix.  This 
smaller  risk  is  achieved  generally  at  the  expense  of  worse  overall  classification  accuracy. 

It  was  not  our  intention  to  find  the  best  possible  solution  for  medical  prognosis  of  head 
injured  patients.  In  fact  the  technique  with  best  classification  performance  remains  the 
standard  statistical  independence  model  on  the  range  of  classifiers  that  we  have  considered, 
although  the  optimum  linear  transformation  gave  better  performance  in  terms  of  the  overall 
loss.  We  have  not  examined  different  types  of  adaptive  networks  or  methods  to  determine 
the  optimum  network  topology  for  this  particular  data  set.  Neither  have  we  managed  to 
solve  the  problem  of  what  is  the  best  strategy  of  treating  missing  data  which  is  optimally 
matched  to  the  network  model.  All  these  must  be  considered  to  be  open  problems. 
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A  The  Generalised  Network  Feature  Extraction  Criterion 


This  appendix  shows  how  minimising  the  error  at  the  output  of  a  network  with  linear  output 
units  is  equivalent  to  maximising  a  specific  feature  extraction  criterion  at  the  outputs  of 
the  final  hidden  layer. 

The  error  to  be  minimised  is 

E=^dT\\^-c^\\1 

»= 1  (19) 

=  ^ll[T-(Ao  +  AH)].D||1 

where  X  is  the  c  x  P  matrix  of  target  patterns,  Ao  is  the  cx  1  vector  of  output  biases,  A  is 
the  c  x  no  matrix  of  weights  between  the  no  hidden  nodes  and  the  c  output  nodes,  H  is  the 
no  x  P  matrix  of  output  patterns  at  the  hidden  layer,  and  D  is  the  P  x  P  diagonal  matrix 
of  error  weightings,  v^p- 

Minimising  (19)  with  respect  to  the  bias  vector  gives  the  optimum  solution  for  the  biases 
as 

A©  =  1  -  Am*  (20) 

where  t  is  the  weighted  target  mean: 

,A  TP2\ 

lf=t4> 


and  m 11  is  the  weighted  mean  pattern  evaluated  on  the  training  set  at  the  outputs  of  the 
hidden  nodes: 


-H  £ 


A  HD2 1 


In  this  equation,  1  is  a  [P  x  1)  vector  of  l's.  Thus,  the  erTor  to  be  minimised  may  be 
expressed  as 


E  =  ^  [f  -  AHj  D  :;j 


(21) 


where  T  =  T  -  Jl*  is  the  mean-shifted  target  matrix  and  H  =  H  -  mH  1’  is  the  matrix 
of  mean-shifted  hidden  unit  output  patterns.  The  weight  matrix  A  which  minimises  (20) 
with  minimum  norm  is 


A  =  TD  (ffD) 


(22) 


where  A +  is  the  Moore-Penrose  pseudo-inverse  of  matrix  >1.  Using  (22)  in  (21)  and 
exploiting  the  properties  of  the  pseudo-inverse  gives  the  error  in  the  form 


E  =  jTr{f  D2T'  -  TD2h‘{HD2H')HD2T’}  (23) 

where  Tr  is  the  trace  operation.  Since  the  targets  and  weights  are  fixed,  minimising  the 
error  is  equivalent  to  maximising  the  function 


J  =  Tr |SgSJ J 


(21) 
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where  the  matrices  St  and  Sb  are  specified  entirely  in  terms  of  the  targets  and  the  distri¬ 
bution  of  patterns  at  the  outputs  of  the  hidden  nodes: 


ST  =  ~Sd2B’ 

Sb  =  (j'j*  HD'f’fD'H' 


These  matrices  may  be  interpreted  as  weighted  total  and  between  class  covariance  ma¬ 
trices  of  the  nonlinearly  transformed  input  patterns.  This  allows  (24)  to  be  interpreted 
as  a  feature  extraction  criterion  for  discriminant  analysis,  where  the  criterion  is  optimum 
for  the  subsequent  linear  classification  scheme  to  the  output  targets.  The  specific  feature 
extraction  criterion  above  is  similar  and  has  an  equivalent  interpretation  to  other  suggested 
cost  functions  used  in  traditional  discrimininant  analysis  [10,  7].  The  difference  is  that 
this  particular  feature  extraction  has  been  optimised  to  achieve  the  best  performance  con¬ 
sistent  with  a  subsequent  linear  classification  of  the  patterns.  Thus,  such  networks  are 
performing  optimised  feature  extraction  and  classification  simultaneously. 
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B  Sum  Rules 


This  appendix  proves  the  result  that  optimally  trained  networks  with  1-from-c  coding 
schemes  satisfy  the  sum  rule  that  the  general  outputs  of  the  network  for  arbitrary  input 
sum  to  unity. 

Theorem. 


Consider  a  network  having  linear  output  units.  Let  the  weights  associated 
with  the  connections  to  these  units  be  determined  by  linear  minimum  norm 
least  squares  optimisation.  Then  if  there  exists  an  arbitrary  linear  constraint 
of  the  form 

u’tp  =  u’t  V  p  =  1,2 . P 

with  u  a  constant  vector,  tT  the  prototype  target  vector  for  the  p-th  pattern 
and  t  the  mean  target  vector,  then  the  general  output  o  of  the  network  also 
satisfies: 

u’o  =  u’t 


Proof 

The  general  output  of  the  network  from  Appendix  A  may  be  expressed  as: 
o  =  t  +  TH+  (h  -  m") 


where  h  is  the  vector  of  outputs  of  the  final  hidden  layer  for  a  given  input  pattern, 
fore 


u’o  =  u’t  -  u’TH  (h  -  mH  j 


There- 


But 

By  hypothesis,  u’T  =  u’fl'. 


u'f  =  u’T  -  u'tl* 

Therefore 

u’o  =  u’t  a 


Remark  1:  If  the  set  of  target  vectors  satisfy  several  linear  constraints  simultaneously, 
then  so  will  the  general  network  outputs. 

Remark  2:  If  u  is  a  vector  with  unity  for  each  component  and  the  prototype  target 
matrix  is  the  identity  matrix  (1-from-c  coding)  then  the  general  network  output  sums  to 
unity. 
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C  Standard  Classifiers. 


This  appendix  gives  a  brief  discussion  of  the  standard  classifiers  employed  in  this  paper. 
The  techniques  employed  were; 


•  Euclidean  distance  to  the  class  mean  (DCM).  For  each  pattern  choose  the  class  which 
has  the  closest  average  pattern  vector  as  determined  on  the  training  set. 

•  Gaussian  classifier  (GC).  Assume  each  pattern  in  each  class  is  drawn  from  a  full 
Gaussian  distribution  where  each  class  may  be  characterised  by  a  mean  vector,  g,  and 
a  (full)  covariance  matrix,  £<•  These  are  approximated  by  the  training  set  samples. 
Then  given  the  i-th  class,  «,  the  probability  of  any  given  pattern  p  belonging  to  that 
class  may  be  expressed  as 

P(p  i)  =  (2r)^rS~  ^  ~(P  '  Pl)*Er,(P  "  Pi) 

Knowing  the  prior  probabilities  p,  of  the  occurrence  of  each  class  allows  the  deter¬ 
mination  of  the  probability  of  that  class  given  the  pattern  as  p,P(p|c).  Thus,  the 
decision  is  to  choose  that  class  which  gives  the  largest  pattern  conditional  probability. 

•  .Nearest  neighbour  (SN):  For  each  pattern  in  a  given  set,  choose  the  class  which  is 
associated  to  the  (Euclidean)  nearest  pattern  in  the  training  set. 

•  K  -nearest  neighbour  (AWN):  For  each  pattern  in  a  given  set,  examine  the  classes 
associated  with  the  K  nearest  patterns  in  the  training  set  and  choose  the  class  which 
occurs  most  often.  Ties  are  broken  by  choosing  the  nearest  class  mean  using  the  A 
nearest  neighbours.  The  optimum  value  of  K  is  chosen  as  the  one  which  gives  best 
performance  on  the  test  set  and  is  therefore  a  biased  estimate  of  the  model  order. 
This  technique  usually  gives  good  results  but  is  computationally  expensive  on  test. 

•  Optimum  Linear  Transformation  (OLT).  For  the  n  x  P  matrix  X  of  input  patterns 
and  the  corresponding  c  y  P  matrix  T  of  target  patterns  on  the  training  set,  find  the 
optimum  c  x  n  matrix,  A  and  c  x  1  vector  6  which  satisfy  the  equation 

AX  4-61*  =  T 

with  minimum  residual  error.  The  solution  with  minimum  Frobenius  norm  may 
be  found  by  pseudo-inverse  methods.  Once  A  and  b  have  been  determined,  an 
arbitrary  input  pattern  is  linearly  transformed  into  a  pattern  t  in  the  target  pattern 
space.  The  class  associated  with  that  input  pattern  is  determined  by  the  closest 
target  vector  to  the  transformed  pattern,  t. 

•  Statistical  Independence  (SI).  This  model  was  discussed  in  the  text  (14).  It  assumes 
that  the  density  estimates  of  each  feature  are  independent  and  so  the  conditional 
probability  density  is  given  by  the  product  of  the  marginal  densities.  The  class 
associated  with  a  pattern  is  determined  from  the  largest  class-conditional  density. 
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